4. Protecting Data at RestNow that the transfer of data over the wire is secure, the next step is
to secure the data when it has reached Microsoft’s servers. A reasonable
question is, why bother? SSL protects against anyone snooping or modifying
traffic over the wire. Microsoft implements various security practices
(from physical to technological) to protect your data once it has reached
its data centers. Isn’t this sufficient security?
For most cases, the answer is “yes.” This isn’t sufficient
only when you have highly sensitive data (health
data, for example) that has regulations and laws surrounding how it is
stored. Though you may be secure enough in practice, you might still need
to encrypt your data to comply with some regulation.
Once you’ve decided that your having data in the clear in
Microsoft’s data centers isn’t acceptable (and you’ve taken into account
the performance overhead of doing so), what do you do?
Cryptography is very dangerous. It is the technological equivalent of
getting someone drunk on tequila and then giving him a loaded bazooka
that can fire forward and backward. It (cryptography, not the fictitious
bazooka) is fiendishly difficult to get right, and most blog posts/book
samples get it wrong in some small, but nevertheless devastating,
way. The only known ways to ensure that some application is
cryptographically sound is to do a thorough analysis of the
cryptographic techniques it uses, and to let experts review/attack the
application for a long time (sometimes years on end). Some of the
widespread cryptographic products (be it the Windows crypto stack or
OpenSSL) are so good because of the attention and analysis they’ve
received. With this in mind, you should reuse some well-known code/library
for all your cryptographic needs. For example, GPGME is a great library
for encrypting files. If you’re rolling your own implementation, ensure
that you have professional cryptographers validate what you’re
doing. The code and techniques shown in this chapter should be sound. You
can use them as a starting point for your own implementation, or to help
you understand how other implementations work. However, you shouldn’t
trust and reuse the code presented here directly in a production
application for the simple reason that it hasn’t undergone thorough
scrutiny from a legion of experts. |
The goal here will be to achieve two things with any data you back
up with the service presented in this chapter. The first is to encrypt
data so that only the original user can decrypt it. The second is to
digitally sign the encrypted data so that any modification is
automatically detected.
4.1. Understanding the Basics of Cryptography
To thoroughly understand how this will be accomplished, you must first
become familiar with some basics of cryptography. Entire books have been
written on cryptography, so this is nothing more than the most fleeting
of introductions, meant more to jog your memory than anything
else.
Note: If you’ve never heard of these terms, you should spend a
leisurely evening (or two…or several) reading up on them before
writing any cryptography code. Unlike a lot of programming where a
coder can explore and copy/paste code from the Web and get away with
it, cryptography and security are places where not having a solid
understanding of the fundamentals can bite you when you least expect
it. To quote the old G.I. Joe advertising slogan, “Knowing is half the
battle.” The other half is probably reusing other people’s
tried-and-tested crypto code whenever you can.
4.1.1. Encryption/decryption
When the term encryption is used in this chapter, it refers
to the process of converting data (plaintext) using an algorithm into
a form (ciphertext) in which it is unreadable without
possession of a key. Decryption is the reverse of this operation, in which a
key is used to convert ciphertext back into plaintext.
4.1.2. Symmetric key algorithms
A symmetric key algorithm is one that uses the same key for both
encryption and decryption. Popular examples are the Advanced
Encryption Standard (AES, also known as Rijndael), Twofish, Serpent, and
Blowfish. A major advantage of using symmetric algorithms is that
they’re quite fast. However, this gets tempered with the disadvantage
that both parties (the one doing the encryption and the one doing the
decryption) need to know the same key.
4.1.3. Asymmetric key algorithms (public key cryptography)
An asymmetric key algorithm is one in which the key used for encryption is different
from the one used for decryption. The major advantage is, of course,
that the party doing the encryption doesn’t need to have access to the
same key as the party doing the decryption.
Typically, each user has a pair of cryptographic keys: a
public key and a private
key. The public key may be widely distributed, but the
private key is kept secret. Though the keys are related
mathematically, the security of these algorithms depends on the fact
that by knowing only one key, it is impossible (or at least
infeasible) to derive the other.
Messages are encrypted with the recipient’s public key, and can
be decrypted only with the associated private key. You can use this
process in reverse to digitally sign data. The sender encrypts a hash
of the data with his private key, and the recipient can decrypt the
hash using the public key, and verify whether it matches a hash the
recipient computes.
The major disadvantage of public key cryptography is that it is
typically highly computationally intensive, and it is often
impractical to encrypt large amounts of data this way. One common
cryptographic technique is to use a symmetric key for quickly
encrypting data, and then encrypting the symmetric key (which is quite
small) with an asymmetric key algorithm. Popular asymmetric key
algorithms include RSA, ElGamal, and others.
4.1.4. Cryptographic hash
A cryptographic hash function is one that takes an arbitrary
block of data and returns a fixed set of bytes. This sounds just like
a normal hash function such as the one you would use in a HashTable, correct? Not quite.
To be considered a cryptographically strong hash function, the
algorithm must have a few key properties. It should be infeasible to
find two messages with the same hash, or to change a message without
changing its hash, or to determine contents of the message given its
hash. Several of these algorithms are in wide use today. As of this
writing, the current state-of-the-art algorithms are those in the
SHA-2 family, and older algorithms such as MD5 and SHA-1 should be
considered insecure.
With that short introduction to cryptography terminology, let’s
get to the real meat of what you will do with azbackup: encrypt data.
4.2. Determining the Encryption Technique
The first criterion in picking an encryption technique is to
ensure that someone getting access to the raw data on the cloud can’t
decrypt. This means not only do you need a strong algorithm, but also
you must keep the key you use to encrypt data away from the cloud.
Actual encryption and decryption won’t happen in the cloud—it’ll happen
in whichever machine talks to the cloud using your code. By keeping the
key in a physically different location, you ensure that an attack on the
cloud alone can’t compromise your data.
The second criterion in picking a design is to have different
levels of access. In short, you can have machines that are trusted to
back up data, but aren’t trusted to read backups.
A common scenario is to have a web server backing up logfiles, so
it must have access to a key to encrypt data. However, you don’t trust
the web server with the ability to decrypt all your data. To do this,
you will use public key cryptography. The public key portion of the key
will be used to encrypt data, and the private key will be used to
decrypt backups. You can now have the public key on potentially insecure
machines doing backups, but keep your private key (which is essentially
the keys to the kingdom) close to your chest.
You’ll be using RSA with 2,048-bit keys as the asymmetric key algorithm.
There are several other options to choose from (such as ElGamal), but
RSA is as good an option as any other, as long as you are careful to use
it in the way it was intended. As of this writing, 2,048 bits is the
recommended length for keys given current computational power.
Note: Cryptographers claim that 2,048-bit keys are secure until 2030.
In comparison, 1,024-bit keys are expected to be cracked by
2011.
Since the archives azbackup
works on are typically very large in size, you can’t directly encrypt
them using RSA. You’ll be generating a symmetric key unique to every
archive (typically called the session key, though there is no session
involved here), and using a block cipher to encrypt the actual data
using that symmetric key. To do this, you’ll be using AES with 256-bit keys. Again, there are several choices,
but AES is widely used, and as of this writing, 256 bits is the optimum
key length.
Since you will use RSA to encrypt the per-archive key, you might
as well use the same algorithm to sign the archives. Signing essentially
takes the cryptographic hash of the encrypted data, and then encrypts
the hash using the key you generated. Cryptographers frown on using the
same key for both encryption and signing, so you’ll generate another RSA
key pair to do this.
Don’t worry if all of this sounds a bit heavy. The actual code to
do all this is quite simple and, more importantly, small.
Note: You might have noticed that the Windows Azure storage account
key hasn’t been mentioned anywhere here. Many believe that public key
cryptography is actually better for super-sensitive,
government-regulated data
because no one but you (not even Microsoft) has the key to get at the
plaintext version of your data. However, the storage account key does
add another layer of defense to your security. If others can’t get
access to it, they can’t get your data.
4.3. Generating Keys
Let’s take a look at some code. Earlier, you learned that for
the sample application you will be using two RSA keys: one for
encrypting session keys for each archive, and one for signing the
encrypted data. These keys will be stored in one file, and will be
passed in the command line to azbackup. Since you can’t expect the users to
have a couple of RSA keys lying around, you will need to provide a
utility to generate it for them.
Since there’s a fair bit of crypto implementation in azbackup, they’re bundled together in a module
called crypto with its implementation
in crypto.py. You’ll learn about
key pieces of code in this module as this discussion progresses.
Example 3 shows the code
for the key-generation utility (creatively titled azbackup-keygen.py). By itself, it isn’t very
interesting. All it does is to take in a command-line parameter
(keyfile) for the path to generate
the key to, and then calls the crypto
module to do the actual key generation.
Example 3. The azbackup-gen-key.py utility
#!/usr/bin/env python """ azbackup-keygen
Generates two 2048 bit RSA keys and stores it in keyfile
Call it like this azbackup -k keyfile """ import sys import optparse import crypto
def main(): # parse command line options
optp = optparse.OptionParser(__doc__) optp.add_option("-k","--keyfile",action="store",\ type="string", dest ="keyfile", default=None) (options, args) = optp.parse_args()
if options.keyfile == None: optp.print_help() return
crypto.generate_rsa_keys(options.keyfile)
if __name__== '__main__': main()
|
The real work is done by crypto.generate_rsa_keys. The implementation
for that method lies in the crypto
module. Let’s first see the code in Example 4, and then examine how it works.
Example 4. Crypto generation of RSA keys
try: import M2Crypto from M2Crypto import EVP, RSA, BIO except: print "Couldn't import M2Crypto. Make sure it is installed." sys.exit(-1)
def generate_rsa_keys(keyfile): """ Generates two 2048 bit RSA keys and stores them sequentially (encryption key first,signing key second) in keyfile """ # Generate the encryption key and store it in bio bio = BIO.MemoryBuffer() generate_rsa_key_bio(bio)
#Generate the signing key and store it in bio generate_rsa_key_bio(bio)
key_output = open(keyfile, 'wb') key_output.write(bio.read())
def generate_rsa_key_bio(bio, bits=2048, exponent = 65537): """ Generates a 2048 RSA key to the file. Use 65537 as default since the use of 3 might have some weaknesses""" def callback(*args): pass keypair = RSA.gen_key(bits, exponent, callback) keypair.save_key_bio(bio, None)
|
If you aren’t familiar with M2Crypto or OpenSSL programming, the
code shown in Example 12-4 probably
looks like gobbledygook. The first few lines import the M2Crypto module
and import a few specific public classes inside that module. This is
wrapped in a try/catch exception handler so that you can print
a nice error message in case the import fails. This is the best way to
check whether M2Crypto is correctly installed on the machine.
The three classes you are importing are EVP, RSA,
and BIO. EVP (which is actually an acronym formed from
the words “Digital EnVeloPe”) is a high-level interface to all the
cipher suites supported by OpenSSL. It essentially provides support for
encrypting and decrypting data using a wide range of algorithms.
RSA, as the name suggests, is a
wrapper around the OpenSSL RSA implementation. This provides support for
generating RSA keys and encryption/decryption using RSA. Finally,
BIO (which actually stands for
“Binary Input Output”) is an I/O abstraction used by OpenSSL. Think of
it as the means by which you can send and get byte arrays from
OpenSSL.
The action kicks off in generate_rsa_keys. This calls out to generate_rsa_key_bio to generate the actual
RSA public/private key pair, and then writes them into the key file. Two
BIO.MemoryBuffer objects are allocated. These are
the byte arrays into which generate_rsa_key_bio will write the RSA
keys.
The key file’s format is fairly trivial. It contains the RSA key
pair used for encryption, followed by the RSA key pair used for
decryption. There is no particular reason to use this order or format.
You could just as easily design a file format or, if you are feeling
really evil, you could put the contents in an XML file. Doing it this
way keeps things simple and makes it easy to read out the keys again. If
you ever need keys of different sizes or types, you will need to revisit
this format.
The actual work of generating an RSA public/private key pair is
done by generate_rsa_key_bio. This
makes a call to RSA.gen_key and
specifies a bit length of 2,048 and a public exponent of 65,537. (This
is an internal parameter used by the RSA algorithm typically set to either 3 or
65,537. Note that using 3 here is considered just as secure.)
The call to RSA.gen_key takes a
long time to complete. In fact, the callback function passed in is
empty, but the RSA.gen_key calls it
with a progress number that can be used to visually indicate
progress.
Why does this take so long? Though this has a bit to do with the
complex math involved, most of the time goes into gathering entropy to
ensure that the key is sufficiently random. Surprisingly, this process
is sped up if there’s activity in the system. The OpenSSL command-line
tool asks people to generate keyboard/mouse/disk activity. The key
generation needs a source of random data (based on pure entropy), and
hardware events are a good source of entropy.
Once the key pair has been generated, it is written out in a
special encoded form into the BIO
object passed in.
Note: If you plan to do this in a language other than Python, you
don’t have to worry. Everything discussed here is typically a standard
part of any mainstream framework.For .NET, to generate RSA keys use the RSACryptoServiceProvider class. Generating
the PEM format from .NET is a bit trickier because it isn’t supported
out of the box. Of course, you can choose to use some other format, or
invent one of your own. If you want to persist with PEM, a quick web
search shows up a lot of sample code to export keys in the PEM format.
You can also P/Invoke the CryptExportPKCS8Ex function in Crypt32.dll.
Thankfully, all of this work is hidden under the covers.
Generating a key file is quite simple. The following command generates a
key file containing the two RSA key pairs at d:\foo.key:
d:\book\code\azbackup>python azbackup-gen-key.py --keyfile d:\foo.key
Warning: Remember to keep this key file safely tucked away. If you lose
this key file, you can never decrypt your archives, and there is no
way to recover the data inside them.